Skip to content

Fix Cloudberry restore point test race condition in WAL archiving#35

Draft
Copilot wants to merge 4 commits into
masterfrom
copilot/fix-cloudberry-test-issue
Draft

Fix Cloudberry restore point test race condition in WAL archiving#35
Copilot wants to merge 4 commits into
masterfrom
copilot/fix-cloudberry-test-issue

Conversation

Copilot AI commented Feb 9, 2026

Copy link
Copy Markdown

Cloudberry test fails when checking for archived WAL files after create-restore-point commands. The test expected ≥2 WAL files per segment but found only 1 for coordinator (seg-1).

Root Cause

create-restore-point triggers pg_switch_wal() to close current WAL segment, but PostgreSQL's archiver executes archive_command asynchronously. Fixed 5-second sleep insufficient for archiver to complete before test verification.

Changes

Retry loop with exponential patience (60 attempts × 5s intervals):

  • Polls S3 until ≥2 WAL files present or 5min timeout
  • Progress logging on each attempt
  • Reduces S3 API overhead vs tight polling

Delay between restore point creations:

  • 2s sleep after each create-restore-point
  • Prevents archiver queue saturation when creating rp1 and rp2 in rapid succession
check_wal_upload() {
    local path=$1
    local max_attempts=60  # 5 minute timeout
    
    while [ $attempt -le $max_attempts ]; do
        local count=$(wal-g st ls "$path" | awk '/^obj/ {count++} END {print count+0}')
        [ "$count" -ge 2 ] && return 0
        sleep 5
        attempt=$((attempt + 1))
    done
    return 1
}
Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] cloudberry test failed</issue_title>
<issue_description>### Database name

cloudberry

WAL-G Version

master

Describe your problem

wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Starting gpstop with args: -a -M fast
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Gathering information and validating the environment...
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 2.1.0-incubating build dev'
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Commencing Coordinator instance shutdown with mode='fast'
wal-g_cloudberry_tests | 20260209:12:09:39:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Coordinator segment instance directory=/usr/local/gpdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Attempting forceful termination of any leftover coordinator process
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Terminating processes for segment /usr/local/gpdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-No standby coordinator host configured
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Targeting dbid [2, 3, 4] for shutdown
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Commencing parallel segment instance shutdown, please wait...
wal-g_cloudberry_tests | 20260209:12:09:40:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-0.00% of jobs completed
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-100.00% of jobs completed
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-----------------------------------------------------
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:- Segments stopped successfully = 3
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:- Segments with errors during stop = 0
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-----------------------------------------------------
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Successfully shutdown 3 of 3 segment instances
wal-g_cloudberry_tests | 20260209:12:09:42:141081 gpstop:0530ff5b9cb9:gpadmin-[INFO]:-Database successfully shutdown with no errors reported
wal-g_cloudberry_tests | + start_cluster
wal-g_cloudberry_tests | + /usr/local/gpdb_src/bin/gpstart -a -t 180
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Starting gpstart with args: -a -t 180
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Gathering information and validating the environment...
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Cloudberry Binary Version: 'postgres (Apache Cloudberry) 2.1.0-incubating build dev'
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Cloudberry Catalog Version: '302502091'
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Starting Coordinator instance in admin mode
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-CoordinatorStart pg_ctl cmd is env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /usr/local/gpdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1 -l /usr/local/gpdb_src/gpAux/gpdemo/datadirs/qddir/demoDataDir-1/log/startup.log -w -t 180 -o " -p 7000 -c gp_role=utility " start
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Setting new coordinator era
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Coordinator Started...
wal-g_cloudberry_tests | 20260209:12:09:42:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Shutting down coordinator
wal-g_cloudberry_tests | 20260209:12:09:46:141423 gpstart:0530ff5b9cb9:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
...


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 9, 2026 18:40
…int test

Co-authored-by: chipitsine <2217296+chipitsine@users.noreply.github.com>
Co-authored-by: chipitsine <2217296+chipitsine@users.noreply.github.com>
Co-authored-by: chipitsine <2217296+chipitsine@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix cloudberry test failures in wal-g Fix Cloudberry restore point test race condition in WAL archiving Feb 9, 2026
Copilot AI requested a review from chipitsine February 9, 2026 18:43
@chipitsine chipitsine force-pushed the master branch 2 times, most recently from 3633c51 to 81395bd Compare May 7, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] cloudberry test failed

2 participants